-
Notifications
You must be signed in to change notification settings - Fork 735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC Semi-structured Data Types #4320
Conversation
This pull request is being automatically deployed with Vercel (learn more). 🔍 Inspect: https://vercel.com/databend/databend/C8wHJXrF8LzmA5QqhbuxPZkXyr1X |
Thanks for the contribution! Please review the labels and make any necessary changes. |
The main disadvantage of the JSON format is that each access requires expensive parsing of the raw string, so there are several optimized binary JSON-like formats to improve parsing speed and single key access. | ||
For example, MongoDB and PostgreSQL use [BSON](https://bsonspec.org/) and [jsonb](https://www.postgresql.org/docs/14/datatype-json.html) respectively to store data in JSON format. | ||
[UBJSON](https://ubjson.org/) is also a compatible format specification for binary JSON, it can provide universal compatibility, as easy of use as JSON while being faster and more efficient. | ||
All of these binary JSON formats have better performance, the only problem is they lack a good Rust implementation libraries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for further storage layout optimization, I think this paper also provide some insights, https://arxiv.org/pdf/2111.11517.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll look into this paper
|
||
```rust | ||
#[derive(Clone)] | ||
pub struct ObjectColumn<T: ObjectType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ObjectType can extend from arrow2's NativeType
.
So values can be Arc<Bytes<T>>
which will have zero cost to take slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
- `Array`: Used to represent dense or sparse arrays of arbitrary size, where the index is a non-negative integer (up to 2^31-1), and values are `Variant` types. | ||
|
||
Since `Object` and `Array` can be regarded as a type of `Variant`, the following introduction mainly takes `Variant` as an example. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example of how users will use this feature? Like presenting a real SQL?
Take https://github.com/datafuselabs/opendal/blob/main/docs/rfcs/0000-example.md#guide-level-explanation for a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some examples, PTAL @Xuanwo
|
||
From the perspective of performance, a better solution is to store data in binary JSON-like format and extract some frequently queried unique keys as sub-columns. | ||
However, in order to simplify development, we use the JSON format in the first version. | ||
Binary JSON-like format and separately stored sub-columns will be adopted in a future optimized version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that for the first version, those cols will not be stored in a "column-oriented" way?
how about the dremel style encoding of nested data structure (which is supported by parquet out of the box)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's pretty complicated, and it seems to only work for the fixed schema column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet nested data type must define schema, not suitable for storing such arbitrarily data.
I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/
Summary
Add RFC for Semi-structured data types design
Changelog
Related Issues
#3916
Test Plan
Unit Tests
Stateless Tests